Anh Nguyen, Amira Bendjama, Hong Doan
The field of data science has experienced remarkable growth in recent years, with organizations across diverse industries recognizing the value of data-driven decision making. According to an article by 365 Data Science, the US Bureau of Labor Statistics estimated that the employment rate for data scientists will grow by 36% from 2021 to 2031. This rate is significantly higher than the average growth rate of 5%, indicating substantial growth and demand for data science talent. The surging demand for data science presents both opportunities and challenges for job seekers, particularly recent graduates. One of the significant hurdles they face is the lack of salary transparency in the data science job market. This opacity creates uncertainty regarding compensation and hinders job seekers' ability to negotiate fair salaries.
There are significant variations in data science salaries across different industries and locations. For instance, according to Zippia, data scientists working in the finance and technology sectors tend to earn higher salaries compared to those in other industries. Similarly, the geographical location also plays a crucial role in determining salaries. Large cities with higher concentration of tech companies and living costs such as San Francisco and New York offer higher salaries than smaller cities.
The discrepancies in data science salaries can also be attributed to various factors, including job responsibilities, experience level, educational background, and specific skill sets. A study conducted by Burtch Works, a leading executive recruiting firm, found that data scientists with advanced degrees, such as Ph.D., tend to command higher salaries compared to those with bachelor's or master's degrees. Similarly, professionals with expertise in specialized areas, such as machine learning or natural language processing, often earn higher salaries due to the high demand for these skills.
According to a report surveyed 1,000 US-based full-time employees, conducted by Visier, 79% of all survey respondents want some form of pay transparency and 32% want total transparency, in which all employee salaries are publicized. However, the 2022 Pay Clarity Survey by WTW found that only 17% of companies are disclosing pay range information in U.S. locations where not required by state or local laws. For the states that have pay transparency laws such as Colorado and New York, there has been a decline in job postings since the law went into effect. Some employers comply with the new laws by expanding the salary ranges, sometimes to ridiculous lengths. These statistics highlight the lack of pay transparency not only in the field of data science, but across multiple job markets. Job seekers often struggle to estimate salaries for data science positions due to the scarcity of reliable information.
To address this problem, our project aims to develop a multiclass classification model that predict the the salary range for data science jobs. By leveraging publicly available data and employing machine learning algorithms, we seek to provide job seekers a better understanding of salary expectations within the data science job market and empower them to negotiate fair and competitive compensation packages.
#install.packages("rpart.plot")
#install.packages("ggplot2")
#install.packages("e1071")
# Install the plotly package
#install.packages("plotly")
# Read the first CSV file
data1 <- read.csv("ds_salaries_2023.csv")
# Read the second CSV file excluding the first column
data2 <- read.csv("ds_salaries.csv")[,-1]
# Append rows from data2 to data1
combined_data <- rbind(data2, data1)
# Write the combined data to a new CSV file
write.csv(combined_data, "combined_salaries.csv", row.names = FALSE)
library(ggplot2)
ds_salaries <- read.csv("combined_salaries.csv")
summary(ds_salaries)
work_year experience_level employment_type job_title salary salary_currency salary_in_usd employee_residence remote_ratio
Min. :2020 Length:4362 Length:4362 Length:4362 Min. : 4000 Length:4362 Min. : 2859 Length:4362 Min. : 0.0
1st Qu.:2022 Class :character Class :character Class :character 1st Qu.: 93918 Class :character 1st Qu.: 90000 Class :character 1st Qu.: 0.0
Median :2022 Mode :character Mode :character Mode :character Median : 135000 Mode :character Median :130000 Mode :character Median : 50.0
Mean :2022 Mean : 209246 Mean :134054 Mean : 49.7
3rd Qu.:2023 3rd Qu.: 180000 3rd Qu.:173000 3rd Qu.:100.0
Max. :2023 Max. :30400000 Max. :600000 Max. :100.0
company_location company_size
Length:4362 Length:4362
Class :character Class :character
Mode :character Mode :character
head(ds_salaries,5)
This data set has 4362 rows and 11 columns
We want to focus on “USD” currency so we keep the “salary_in_usd” column and drop “salary_currency” and “salary” column by using subset()
ds_salaries <- subset(ds_salaries, select = -c( salary_currency, salary))
head(ds_salaries, 5)
num_null_rows <- sum(rowSums(is.na(ds_salaries)) == ncol(ds_salaries))
print(num_null_rows)
[1] 0
There are no null values
repeated_entries <- subset(ds_salaries, duplicated(ds_salaries))
print(repeated_entries)
# Remove duplicate rows
df <- ds_salaries[!duplicated(ds_salaries), ]
# check again
repeated_entries_new <- subset(df, duplicated(df))
print(repeated_entries_new)
Adding new column to split our salaries into three groups Low , High, Medium.The approach is to use Percentiles by Dividing the dataset based on them. Hence, we are classifying salaries below the 25th percentile as “Low”, salaries between the 25th and 75th percentile as “Medium”, and salaries above the 75th percentile as “High”.
# adding new column
# Calculate the percentiles
percentiles <- quantile(df$salary_in_usd, probs = c(0.25, 0.75))
# Define the thresholds
low_threshold <- percentiles[1] # 25th percentile
high_threshold <- percentiles[2] # 75th percentile
# Create a new column based on percentiles
df$salary_classification <- ifelse(df$salary_in_usd < low_threshold, "Low",
ifelse(df$salary_in_usd > high_threshold, "High", "Medium"))
table(df$salary_classification)
High Low Medium
644 667 1357
# Get top 10 job titles and their value counts
top10_job_title <- head(sort(table(df$job_title), decreasing = TRUE), 10)
top10_job_title_df <- data.frame(job_title = names(top10_job_title), count = as.numeric(top10_job_title))
top10_job_title_df
# Load the required packages
library(plotly)
Attaching package: ‘plotly’
The following object is masked from ‘package:ggplot2’:
last_plot
The following object is masked from ‘package:stats’:
filter
The following object is masked from ‘package:graphics’:
layout
# Define custom color palette
custom_colors <- c("#FF6361", "#FFA600", "#FFD700", "#FF76BC", "#69D2E7", "#6A0572", "#FF34B3", "#118AB2", "#FFFF99", "#FFC1CC")
# Create bar plot
fig <- plot_ly(data = top10_job_title_df, x = ~reorder(job_title, -count), y = ~count, type = "bar",
marker = list(color = custom_colors), text = ~count) %>%
layout(title = "Top 10 Job Titles", xaxis = list(title = "Job Titles"), yaxis = list(title = "Count"),
font = list(size = 17), template = "plotly_dark")
# Adjust layout settings to avoid label overlap
fig <- fig %>% layout(
margin = list(b = 150, t = 100), # Increase bottom and top margin to provide space for labels
xaxis = list(
tickangle = 45, # Rotate x-axis tick labels
automargin = TRUE # Automatically adjust margins to avoid overlap
)
)
# Display the plot
fig
Our Dataset has 4 different experience categories: - EN: Entry-level / Junior - MI: Mid-level / Intermediate - SE: Senior-level / Expert - EX: Executive-level / Director
# Create a mapping of category abbreviations to full names
category_names_experience <- c("EN" = "Entry-level",
"MI" = "Mid-level",
"SE" = "Senior-level",
"EX" = "Executive-level")
# Get the sorted experience data
experience <- head(sort(table(df$experience_level), decreasing = TRUE))
# Replace the category names with full forms
names(experience) <- category_names_experience[names(experience)]
# Calculate the percentage for each category
percentages <- round(100 * experience / sum(experience), 2)
# Define a custom color palette
custom_colors <- c("#FFA998", "#FF76BC", "#69D2E7", "#FFA600")
# Create a pie chart with cute appearance
pie(experience, labels = paste(names(experience), "(", percentages, "%)"), col = custom_colors, border = "white", clockwise = TRUE, init.angle = 90)
# Add a legend with cute colors
legend("topright", legend = names(experience), fill = custom_colors, border = "white", cex = 0.8)
# Add a title with a cute font
title("Experience Distribution", font.main = 1)
# Create a mapping of category abbreviations to full names
category_names_company <- c("M" = "Medium",
"L" = "Large",
"S" = "Small"
)
# Get the sorted company size data
company_size <- head(sort(table(df$company_size), decreasing = TRUE))
# Replace the category names with full forms
names(company_size) <- category_names_company[names(company_size)]
# Set the maximum value for the y-axis
max_count <- max(company_size)
# Create a bar plot with adjusted y-axis limits
barplot(company_size, col = custom_colors, main = "Company Size Distribution", xlab = "Company Size", ylab = "Count", ylim = c(0, max_count + 10))
# Set the scipen option to a high value
options(scipen = 10)
# Create boxplot of salaries
bp <- boxplot(df$salary_in_usd / 1000,
col = "skyblue",
main = "Boxplot of Salaries",
ylab = "Salary in Thousands USD",
notch = TRUE)
# Get the sorted salary classification data
salary_classification <- sort(table(df$salary_classification), decreasing = TRUE)
salary_classification_df <- data.frame(salary_classification= names(salary_classification ), count = as.numeric(salary_classification ))
fig <- plot_ly(
data = salary_classification_df,
x = ~reorder(salary_classification, -count),
y = ~count,
type = "bar",
marker = list(color = custom_colors),
text = ~count,
width = 700,
height = 400
)
fig <- fig %>% layout(
title = "Salary Classification Distribution",
xaxis = list(title = "Salary Classification"),
yaxis = list(title = "Count"),
font = list(size = 17),
template = "ggplot2"
)
fig
# Create a data frame with counts of experience levels by salary classification
experience_salary <- table(df$experience_level, df$salary_classification)
# Define custom colors for each experience level
custom_colors <- c("#69D2E7", "#1900ff", "#FF6361", "#FFD700")
# Create a data frame for the plot
plot_data <- data.frame(Experience = rownames(experience_salary),
Salary_Classification = colnames(experience_salary),
Count = as.vector(experience_salary))
# Convert Count column to numeric
plot_data$Count <- as.numeric(plot_data$Count)
# Create the bar plot
library(plotly)
fig <- plot_ly(data = plot_data, x = ~Salary_Classification, y = ~Count,
color = ~Experience, colors = custom_colors, type = "bar") %>%
layout(title = "Experience Level by Salary Classification",
xaxis = list(title = "Salary Classification"),
yaxis = list(title = "Count"),
font = list(size = 17),
template = "plotly_dark")
fig
In the feature engineering process, several modifications have been made to enhance the balance and categorization of specific columns. The following changes have been implemented:
Company Location and Employee Residence: The “company_location” and “employee_residence” variables have been updated to ensure better balance in the categories. The values in these columns have been transformed into either “US” or “Other”. This modification aims to create a more balanced representation of company and employee locations, which can improve the model’s performance.
Job Titles: The original dataset contains a wide range of job titles (95 categories), which can lead to complexity and overfitting in the model. To simplify and generalize the job titles, they have been grouped into four categories: “Data Analyst”, “Data Engineer”, “Data Scientist”, and “Other”. This categorization allows for a more concise representation of job roles, reducing the dimensionality and enhancing interpretability in the model.
To handle the categorical columns, the “Factor()” function has been used. This function converts the categorical variables into factors, which are a type of data structure in R that represent categorical data. By converting the features into factors, it enables the model to understand and utilize the categorical information effectively.
The selected features for the model include “work_year”, “experience_level”, “employment_type”, “job_title”, “employee_residence”, “remote_ratio”, “company_location”, and “company_size”. These features provide relevant information related to work experience, level, type of employment, job title, employee and company locations, remote work ratio, and company size. The target variable for the model is the “salary_classification” column, which classifies salaries into three categories: “Low”, “Medium”, and “High”.
table(df$job_title)
3D Computer Vision Researcher AI Developer AI Programmer AI Scientist
4 11 2 16
Analytics Engineer Applied Data Scientist Applied Machine Learning Engineer Applied Machine Learning Scientist
91 10 2 13
Applied Scientist Autonomous Vehicle Technician Azure Data Engineer BI Analyst
31 2 1 9
BI Data Analyst BI Data Engineer BI Developer Big Data Architect
15 1 11 2
Big Data Engineer Business Data Analyst Business Intelligence Engineer Cloud Data Architect
11 17 4 1
Cloud Data Engineer Cloud Database Engineer Compliance Data Analyst Computer Vision Engineer
3 5 1 18
Computer Vision Software Engineer Data Analyst Data Analytics Consultant Data Analytics Engineer
5 404 2 6
Data Analytics Lead Data Analytics Manager Data Analytics Specialist Data Architect
2 18 2 64
Data DevOps Engineer Data Engineer Data Engineering Manager Data Infrastructure Engineer
1 623 5 6
Data Lead Data Management Specialist Data Manager Data Modeler
2 1 23 2
Data Operations Analyst Data Operations Engineer Data Quality Analyst Data Science Consultant
4 6 5 23
Data Science Engineer Data Science Lead Data Science Manager Data Science Tech Lead
5 8 52 1
Data Scientist Data Scientist Lead Data Specialist Data Strategist
555 2 12 2
Deep Learning Engineer Deep Learning Researcher Director of Data Engineering Director of Data Science
6 1 2 12
ETL Developer ETL Engineer Finance Data Analyst Financial Data Analyst
8 2 1 4
Head of Data Head of Data Science Head of Machine Learning Insight Analyst
11 9 2 2
Lead Data Analyst Lead Data Engineer Lead Data Scientist Lead Machine Learning Engineer
5 7 9 4
Machine Learning Developer Machine Learning Engineer Machine Learning Infrastructure Engineer Machine Learning Manager
9 214 12 3
Machine Learning Research Engineer Machine Learning Researcher Machine Learning Scientist Machine Learning Software Engineer
4 5 26 10
Manager Data Management Marketing Data Analyst Marketing Data Engineer ML Engineer
1 2 1 35
MLOps Engineer NLP Engineer Power BI Developer Principal Data Analyst
4 8 1 2
Principal Data Architect Principal Data Engineer Principal Data Scientist Principal Machine Learning Engineer
1 3 9 1
Product Data Analyst Product Data Scientist Research Engineer Research Scientist
5 1 33 67
Software Data Engineer Staff Data Analyst Staff Data Scientist
2 1 1
df$company_location <- ifelse(df$company_location == "US", "US", "Other")
df$employee_residence <- ifelse(df$employee_residence == "US", "US", "Other")
df$job_title <- ifelse(grepl("Data Science", df$job_title) | grepl("Data Scientist", df$job_title), "Data Scientist",
ifelse(grepl("Analyst", df$job_title) | grepl("Analytics", df$job_title), "Data Analyst",
ifelse(grepl("Data Engineer", df$job_title) | grepl("Data Engineering", df$job_title), "Data Engineer",
"Other")))
table(df$job_title)
Data Analyst Data Engineer Data Scientist Other
598 659 697 714
table(df$employee_residence)
Other US
768 1900
table(df$company_location)
Other US
732 1936
df <- data.frame(lapply(df, factor))
factors <- sapply(df, is.factor)
factor_cols <- names(df[factors])
factor_cols
[1] "work_year" "experience_level" "employment_type" "job_title" "salary_in_usd" "employee_residence"
[7] "remote_ratio" "company_location" "company_size" "salary_classification"
column_names <- colnames(data1)
print(column_names)
[1] "work_year" "experience_level" "employment_type" "job_title" "salary" "salary_currency" "salary_in_usd"
[8] "employee_residence" "remote_ratio" "company_location" "company_size"
# 3 - 58
set.seed(3) # Set a seed for reproducibility
train_indices <- sample(1:nrow(df), 0.9 * nrow(df)) # 80% for training
train_data <- df[train_indices, ]
test_data <- df[-train_indices, ]
# Separate the features (independent variables) from the target variable
X <- train_data[, !(names(train_data) %in% c("salary_in_usd", "salary_classification"))]
#X <- train_data[,c("experience_level","company_size","remote_ratio")]
Y <- train_data$salary_classification
library(nnet)
# Fit the logistic regression model
logistic_model <- multinom(Y ~ ., data = X)
# weights: 60 (38 variable)
initial value 2637.768105
iter 10 value 1988.010841
iter 20 value 1843.567056
iter 30 value 1827.493659
iter 40 value 1826.250743
iter 50 value 1826.149391
final value 1826.148310
converged
# Make predictions on the test data
test_data$predicted_classification <- predict(logistic_model, newdata = test_data)
# Evaluate model performance
library(caret)
Loading required package: lattice
confusion_matrix <- confusionMatrix(test_data$predicted_classification, test_data$salary_classification)
print(confusion_matrix)
Confusion Matrix and Statistics
Reference
Prediction High Low Medium
High 16 0 15
Low 0 47 17
Medium 45 15 112
Overall Statistics
Accuracy : 0.6554
95% CI : (0.5951, 0.7123)
No Information Rate : 0.5393
P-Value [Acc > NIR] : 0.00007773
Kappa : 0.3959
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: High Class: Low Class: Medium
Sensitivity 0.26230 0.7581 0.7778
Specificity 0.92718 0.9171 0.5122
Pos Pred Value 0.51613 0.7344 0.6512
Neg Pred Value 0.80932 0.9261 0.6632
Prevalence 0.22846 0.2322 0.5393
Detection Rate 0.05993 0.1760 0.4195
Detection Prevalence 0.11610 0.2397 0.6442
Balanced Accuracy 0.59474 0.8376 0.6450
- The accuracy of the model is reported as 0.6554, which means that approximately 65.54% of the predictions made by the model are correct.
- The model shows relatively lower sensitivity for the "High" class and higher sensitivity for the "Low" and "Medium" classes. This indicates that the model has difficulty correctly identifying instances belonging to the "High" class, while it performs better in identifying instances from the "Low" and "Medium" classes.
# Load the randomForest package
library(randomForest)
randomForest 4.7-1.1
Type rfNews() to see new features/changes/bug fixes.
Attaching package: ‘randomForest’
The following object is masked from ‘package:ggplot2’:
margin
library(caret)
# Train the Random Forest classifier
rf_model <- randomForest(X, Y)
# Make predictions on new data
# Assuming you have a data frame called test_data with similar features as train_data
predictions <- predict(rf_model, test_data)
# Calculate accuracy
accuracy <- sum(predictions == test_data$salary_classification) / length(test_data$salary_classification)
cat("Accuracy:", accuracy, "\n")
Accuracy: 0.659176
# Create confusion matrix
conf_matrix <- table(predictions, test_data$salary_classification)
cat("Confusion Matrix:\n")
Confusion Matrix:
print(conf_matrix)
predictions High Low Medium
High 4 0 0
Low 1 46 18
Medium 56 16 126
# Calculate precision, recall, and F1-score for each class
class_metrics <- caret::confusionMatrix(predictions, test_data$salary_classification)
cat("Class Metrics:\n")
Class Metrics:
print(class_metrics$byClass)
Sensitivity Specificity Pos Pred Value Neg Pred Value Precision Recall F1 Prevalence Detection Rate Detection Prevalence Balanced Accuracy
Class: High 0.06557377 1.0000000 1.0000000 0.7832700 1.0000000 0.06557377 0.1230769 0.2284644 0.01498127 0.01498127 0.5327869
Class: Low 0.74193548 0.9073171 0.7076923 0.9207921 0.7076923 0.74193548 0.7244094 0.2322097 0.17228464 0.24344569 0.8246263
Class: Medium 0.87500000 0.4146341 0.6363636 0.7391304 0.6363636 0.87500000 0.7368421 0.5393258 0.47191011 0.74157303 0.6448171
- The Random Forest model achieved an accuracy of approximately 0.659 (65.92%). This means that around 65.92% of the predictions made by the model on the test data were correct.
- It demonstrates relatively higher accuracy, sensitivity, and precision for the "Low" and "Medium" classes compared to the "High" class. However, it shows lower performance in terms of sensitivity and precision for the "High" class, indicating difficulties in correctly identifying instances belonging to that class.
importance <- varImp(rf_model)
print(importance)
The variable importance measures obtained from the Random Forest model provide valuable insights into the relative contribution of each feature in predicting the salary classification. Among the features, “experience_level” and “employee_residence” stand out as the most influential variables with importance values of 100.35 and 108.70, respectively. These findings suggest that an employee’s experience level and their residence location play crucial roles in determining the salary classification. The “job_title” and “company_location” features also demonstrate notable importance, with values of 49.63 and 89.74, respectively, indicating that job title and company location significantly impact salary classification. Additionally, moderately important features such as “work_year” (25.56), “remote_ratio” (27.46), and “company_size” (26.49) contribute to the model’s predictions.
On the other hand, the “employment_type” feature exhibits a relatively lower importance value of 7.41, suggesting that it has a weaker impact on the model’s predictions compared to other variables. While the “employment_type” may have some relevance, it seems to provide less discriminatory power for salary classification in the context of the Random Forest model.
library(e1071)
# Train the SVM classifier
svm_model <- svm(Y ~ ., data = X, kernel = "radial")
# Make predictions on new data
# Assuming you have a data frame called test_data with similar features as train_data
predictions <- predict(svm_model, test_data)
# Evaluate the model
# Assuming you have the actual target variable values in test_data$salary_classification
accuracy <- sum(predictions == test_data$salary_classification) / length(test_data$salary_classification)
cat("Accuracy:", accuracy, "\n")
Accuracy: 0.6516854
# Create confusion matrix
conf_matrix <- table(predictions, test_data$salary_classification)
cat("Confusion Matrix:\n")
Confusion Matrix:
print(conf_matrix)
predictions High Low Medium
High 5 0 2
Low 2 48 21
Medium 54 14 121
library("rpart")
library("rpart.plot")
decision_tree <- rpart(Y ~ .,
data = X,
method="class")
# Make predictions on test data
predictions <- predict(decision_tree, newdata = test_data, type = "class")
# Evaluate the model
accuracy <- sum(predictions == test_data$salary_classification) / nrow(test_data)
print(paste("Accuracy:", accuracy))
[1] "Accuracy: 0.640449438202247"
rpart.plot(decision_tree)
Continuous data updating: Update the dataset with the most recent salary information by scraping salary data and incorporating additional features such as degree level and certifications from reliable sources or leveraging APIs. This will ensure that the dataset reflects current salary trends and provide a deeper insight into the determinants of salaries.
Migrate geographic bias by expand the data collection to include the data from a wide range of locations, not just limited to the U.S. Incorporate salary data from different countries or regions to provide a more comprehensive and representative view of the data science job market.
Explore advanced modeling techniques: such as neural network and ensemble methods to handle class imbalance.